Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Variant Discovery ◾ 131

mkdir fastq

cd fastq

while read id;

sam-dump \

--verbose \

--fastq \

--aligned-region chr21 \

--output-file ${id}_chr21.fastq \

${id}; \

done < ../ids.txt

cd ..

The download may take a long time depending on the speed of the Internet connection and

computer memory and processing units. The above script will create the “fastq” directory

and save the FASTQ files of chromosome 21 of 13 human individuals.

4.2.2.2.2 The reference genome

The FASTA sequence of the reference genome is required for reads mapping. We can

download it from a reliable database such as the NCBI Genome database. However, for

GATK pipeline, the sequence of the human genome can be downloaded from GATK

resource bundle, which is a collection of standard files prepared to be used with GATK.

The resource bundle is hosted on a Google Cloud bucket and can be accessed with a google

account using the following address:

https://console.cloud.google.com/storage/browser/genomics-public-data/resources/

broad/hg38/v0/

TABLE 4.2 The NCBI SRA Run IDs and Individual

Information

Run ID

Country

Population

Gender

ERR1019055

China

East Asia

Female

ERR1019056

China

East Asia

Male

ERR1019057

China

East Asia

Female

ERR1019081

Pakistan

South Asia

Male

ERR1025616

Pakistan

South Asia

Male

ERR1019044

Kenya

Africa

Female

ERR1025600

Kenya

Africa

Female

ERR1025621

Nigeria

Africa

Male

ERR1025640

Senegal

Africa

Male

ERR1019034

Russia

Europe

Male

ERR1019045

France

Europe

Male

ERR1019068

Italy

Europe

Male

ERR1025614

Bulgaria

Europe

Male